{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Autoscaling a service with Amazon SageMaker\n",
    "\n",
    "This notebook shows an example of how to use reinforcement learning technique to address a very common problem in production operation of software systems: scaling a production service by adding and removing resources (e.g. servers or EC2 instances) in reaction to dynamically changing load. This example is a simple toy demonstrating how one might begin to address this real and challenging problem. We build up a fake simulated system with daily and weekly variations and occassional spikes. It also has a delay between when new resources are requested and when they become available for serving requests. The customized environment is constructed using Open AI gym and the RL agents are trained using Amazon SageMaker."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem Statement\n",
    "\n",
    "Autoscaling enables services to dynamically update capacity up or down automatically depending on conditions you define. Today, this requires setting up alarms, scaling policies, thresholds etc. Under the customized simulator, the RL problem for autoscaling can be defined as: \n",
    "\n",
    "1. *Objective*: Optimize profit of a scalable web service by adapting instance capacity to load profile. Meanwhile, ensure the servers/instances are sufficient when a spike occurs.\n",
    "2. *Environment*: Custom developed environment that includes the load profile. It generates a fake simulated load with daily and weekly variations and occasional spikes. The simulated system has a delay between when new resources are requested and when they become available for serving requests.\n",
    "3. *State*: A time-weighted combination of previous and current observations. At each timestamp, an observation includes current load (transactions this minute), number of failed transactions, a boolean variable indicating whether the service is in downtime (when availability drops below 99.5%), and the current number of active machines.\n",
    "4. *Action*: Remove or add machines. The agent can do both at the same time.\n",
    "5. *Reward*: A customized reward function based on a simple financial model. On top of positive reward for successful transactions, we take costs for running machines into consideration. We also apply a high penalty for downtime."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Amazon SageMaker for RL\n",
    "\n",
    "Amazon SageMaker allows you to train your RL agents in cloud machines using docker containers. You do not have to worry about setting up your machines with the RL toolkits and deep learning frameworks. You can easily switch between many different machines setup for you, including powerful GPU machines that give a big speedup. You can also choose to use multiple machines in a cluster to further speedup training, often necessary for production level loads."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pre-requisites\n",
    "\n",
    "### Roles and permissions\n",
    "\n",
    "To get started, we'll import the Python libraries we need, set up the environment with a few prerequisites for permissions and configurations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "import sys\n",
    "import os\n",
    "import glob\n",
    "import re\n",
    "import subprocess\n",
    "from IPython.display import HTML\n",
    "import time\n",
    "from time import gmtime, strftime\n",
    "sys.path.append(\"common\")\n",
    "from misc import get_execution_role, wait_for_s3_object\n",
    "from sagemaker.rl import RLEstimator, RLToolkit, RLFramework"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Steup S3 buckets\n",
    "\n",
    "Set up the linkage and authentication to the S3 bucket that you want to use for checkpoint and the metadata. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# S3 bucket\n",
    "sage_session = sagemaker.session.Session()\n",
    "s3_bucket = sage_session.default_bucket()  \n",
    "s3_output_path = 's3://{}/'.format(s3_bucket) # SDK appends the job name and output folder\n",
    "print(\"S3 bucket path: {}\".format(s3_output_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Variables \n",
    "\n",
    "We define variables such as the job prefix for the training jobs *and the image path for the container (only when this is BYOC).*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create unique job name \n",
    "job_name_prefix = 'rl-auto-scaling'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure settings\n",
    "\n",
    "You can run your RL training jobs on a SageMaker notebook instance or on your own machine. In both of these scenarios, you can run the following in either `local` or `SageMaker` modes. The `local` mode uses the SageMaker Python SDK to run your code in a local container before deploying to SageMaker. This can speed up iterative testing and debugging while using the same familiar Python SDK interface. You just need to set `local_mode = True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# run in local mode?\n",
    "local_mode = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create an IAM role\n",
    "Either get the execution role when running from a SageMaker notebook `role = sagemaker.get_execution_role()` or, when running from local machine, use utils method `role = get_execution_role()` to create an execution role."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    role = sagemaker.get_execution_role()\n",
    "except:\n",
    "    role = get_execution_role()\n",
    "    \n",
    "print(\"Using IAM role arn: {}\".format(role))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install docker for `local` mode\n",
    "\n",
    "In order to work in `local` mode, you need to have docker installed. When running from you local machine, please make sure that you have docker or docker-compose (for local CPU machines) and nvidia-docker (for local GPU machines) installed. Alternatively, when running from a SageMaker notebook instance, you can simply run the following script to install dependenceis.\n",
    "\n",
    "Note, you can only run a single local notebook at one time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# only run from SageMaker notebook instance\n",
    "if local_mode:\n",
    "    !/bin/bash ./common/setup.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the environment\n",
    "\n",
    "The environment is defined in a Python file called `autoscalesim.py` and the file is uploaded on `/src` directory. \n",
    "\n",
    "The environment also implements the `init()`, `step()` and `reset()` functions that describe how the environment behaves. This is consistent with Open AI Gym interfaces for defining an environment. \n",
    "\n",
    "\n",
    "1. init() - initialize the environment in a pre-defined state\n",
    "2. step() - take an action on the environment\n",
    "3. reset()- restart the environment on a new episode\n",
    "4. [if applicable] render() - get a rendered image of the environment in its current state"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize src/autoscalesim.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize the simulated load\n",
    "\n",
    "The shape of the simulated load is critical to an auto-scaling simulation. We use the this toy load simulator for visualization. The simulator has two components to load: periodic load and spikes. The periodic load is a simple daily cycle of fixed mean & amplitude, with multiplicative gaussian noise. The spike load start instantly and decay linearly until gone, and have a variable random delay between them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if open AI Gym is not installed\n",
    "! pip install gym"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "sys.path.append('src')\n",
    "import autoscalesim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def xy_data(days_to_simulate=3):\n",
    "    loadsim = autoscalesim.LoadSimulator()\n",
    "    load = []\n",
    "    x = np.arange(0, days_to_simulate, 1.0/(24*60))\n",
    "    for t in x:\n",
    "        load.append(loadsim.time_step_load())\n",
    "    load = np.asarray(load)\n",
    "    return (x, load)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "plt.rcParams[\"figure.figsize\"] = (20,8)\n",
    "\n",
    "for n in range(5):  # Draw 5 plots\n",
    "    (x,y) = xy_data()\n",
    "    fig, ax = plt.subplots()\n",
    "    ax.plot(x, y)\n",
    "    ax.set(xlabel='time (days)', ylabel='load (tpm)',\n",
    "           title='Load simulation #%d' % n)\n",
    "    ax.grid()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure the presets for RL algorithm \n",
    "\n",
    "The presets that configure the RL training jobs are defined in the `preset-autoscale-ppo.py` file which is also uploaded on the `/src` directory. Using the preset file, you can define agent parameters to select the specific agent algorithm. You can also set the environment parameters, define the schedule and visualization parameters, and define the graph manager. The schedule presets will define the number of heat up steps, periodic evaluation steps, training steps between evaluations.\n",
    "\n",
    "These can be overridden at runtime by specifying the `RLCOACH_PRESET` hyperparameter. Additionally, it can be used to define custom hyperparameters. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize src/preset-autoscale-ppo.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Write the Training Code \n",
    "\n",
    "The training code is written in the file “train-coach.py” which is uploaded in the /src directory. \n",
    "First import the environment files and the preset files, and then define the `main()` function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize src/train-coach.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the RL model using the Python SDK Script mode\n",
    "\n",
    "If you are using local mode, the training will run on the notebook instance. When using SageMaker for training, you can select a GPU or CPU instance. The RLEstimator is used for training RL jobs. \n",
    "\n",
    "1. Specify the source directory where the environment, presets and training code is uploaded.\n",
    "2. Specify the entry point as the training code \n",
    "3. Specify the choice of RL toolkit and framework. This automatically resolves to the ECR path for the RL Container. \n",
    "4. Define the training parameters such as the instance count, job name, S3 path for output and job name. \n",
    "5. Specify the hyperparameters for the RL agent algorithm. The `RLCOACH_PRESET` can be used to specify the RL agent algorithm you want to use. \n",
    "6. [Optional] Choose the metrics that you are interested in capturing in your logs. These can also be visualized in CloudWatch and SageMaker Notebooks. The metrics are defined using regular expression matching. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "if local_mode:\n",
    "    instance_type = 'local'\n",
    "else:\n",
    "    instance_type = \"ml.m4.xlarge\"\n",
    "        \n",
    "estimator = RLEstimator(entry_point=\"train-coach.py\",\n",
    "                        source_dir='src',\n",
    "                        dependencies=[\"common/sagemaker_rl\"],\n",
    "                        toolkit=RLToolkit.COACH,\n",
    "                        toolkit_version='0.11.0',\n",
    "                        framework=RLFramework.TENSORFLOW,\n",
    "                        role=role,\n",
    "                        train_instance_type=instance_type,\n",
    "                        train_instance_count=1,\n",
    "                        output_path=s3_output_path,\n",
    "                        base_job_name=job_name_prefix,\n",
    "                        hyperparameters = {\n",
    "                          \"RLCOACH_PRESET\": \"preset-autoscale-ppo\",\n",
    "                          \"rl.agent_params.algorithm.discount\": 0.9,\n",
    "                          \"rl.evaluation_steps:EnvironmentEpisodes\": 8,\n",
    "                          # save model for deployment\n",
    "                          \"save_model\": 1\n",
    "                        }\n",
    "                    )\n",
    "estimator.fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Store intermediate training output and model checkpoints \n",
    "\n",
    "The output from the training job above is either stored in a local directory (`local` mode) or on S3 (`SageMaker`) mode.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "job_name=estimator._current_job_name\n",
    "print(\"Job name: {}\".format(job_name))\n",
    "\n",
    "s3_url = \"s3://{}/{}\".format(s3_bucket,job_name)\n",
    "\n",
    "if local_mode:\n",
    "    output_tar_key = \"{}/output.tar.gz\".format(job_name)\n",
    "else:\n",
    "    output_tar_key = \"{}/output/output.tar.gz\".format(job_name)\n",
    "\n",
    "intermediate_folder_key = \"{}/output/intermediate/\".format(job_name)\n",
    "output_url = \"s3://{}/{}\".format(s3_bucket, output_tar_key)\n",
    "intermediate_url = \"s3://{}/{}\".format(s3_bucket, intermediate_folder_key)\n",
    "\n",
    "print(\"S3 job path: {}\".format(s3_url))\n",
    "print(\"Output.tar.gz location: {}\".format(output_url))\n",
    "print(\"Intermediate folder path: {}\".format(intermediate_url))\n",
    "    \n",
    "tmp_dir = \"/tmp/{}\".format(job_name)\n",
    "os.system(\"mkdir {}\".format(tmp_dir))\n",
    "print(\"Create local folder {}\".format(tmp_dir))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plot rate of learning\n",
    "\n",
    "We can view the rewards during training using the code below. This visualization helps us understand how the performance of the model represented as the reward has improved over time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import pandas as pd\n",
    "\n",
    "csv_file_name = \"worker_0.simple_rl_graph.main_level.main_level.agent_0.csv\"\n",
    "key = os.path.join(intermediate_folder_key, csv_file_name)\n",
    "wait_for_s3_object(s3_bucket, key, tmp_dir)\n",
    "\n",
    "csv_file = \"{}/{}\".format(tmp_dir, csv_file_name)\n",
    "df = pd.read_csv(csv_file)\n",
    "df = df.dropna(subset=['Training Reward'])\n",
    "x_axis = 'Episode #'\n",
    "y_axis = 'Training Reward'\n",
    "\n",
    "plt = df.plot(x=x_axis,y=y_axis, figsize=(12,5), legend=True, style='b-')\n",
    "plt.set_ylabel(y_axis);\n",
    "plt.set_xlabel(x_axis);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation of RL models\n",
    "\n",
    "We use the latest checkpointed model to run evaluation for the RL Agent. \n",
    "\n",
    "### Load the checkpointed models \n",
    "\n",
    "Checkpointed data from the previously trained models will be passed on for evaluation / inference in the `checkpoint` channel. In `local` mode, we can simply use the local directory, whereas in the `SageMaker` mode, it needs to be moved to S3 first.\n",
    "\n",
    "Since TensorFlow stores ckeckpoint file containes absolute paths from when they were generated (see [issue](https://github.com/tensorflow/tensorflow/issues/9146)), we need to replace the absolute paths to relative paths. This is implemented within `evaluate-coach.py`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "wait_for_s3_object(s3_bucket, output_tar_key, tmp_dir)  \n",
    "\n",
    "if not os.path.isfile(\"{}/output.tar.gz\".format(tmp_dir)):\n",
    "    raise FileNotFoundError(\"File output.tar.gz not found\")\n",
    "os.system(\"tar -xvzf {}/output.tar.gz -C {}\".format(tmp_dir, tmp_dir))\n",
    "\n",
    "if local_mode:\n",
    "    checkpoint_dir = \"{}/data/checkpoint\".format(tmp_dir)\n",
    "else:\n",
    "    checkpoint_dir = \"{}/checkpoint\".format(tmp_dir)\n",
    "\n",
    "print(\"Checkpoint directory {}\".format(checkpoint_dir))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "if local_mode:\n",
    "    checkpoint_path = 'file://{}'.format(checkpoint_dir)\n",
    "    print(\"Local checkpoint file path: {}\".format(checkpoint_path))\n",
    "else:\n",
    "    checkpoint_path = \"s3://{}/{}/checkpoint/\".format(s3_bucket, job_name)\n",
    "    if not os.listdir(checkpoint_dir):\n",
    "        raise FileNotFoundError(\"Checkpoint files not found under the path\")\n",
    "    os.system(\"aws s3 cp --recursive {} {}\".format(checkpoint_dir, checkpoint_path))\n",
    "    print(\"S3 checkpoint file path: {}\".format(checkpoint_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run the evaluation step\n",
    "\n",
    "Use the checkpointed model to run the evaluation step. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "estimator_eval = RLEstimator(role=role,\n",
    "                      source_dir='src/',\n",
    "                      dependencies=[\"common/sagemaker_rl\"],\n",
    "                      toolkit=RLToolkit.COACH,\n",
    "                      toolkit_version='0.11.0',\n",
    "                      framework=RLFramework.TENSORFLOW,\n",
    "                      entry_point=\"evaluate-coach.py\",\n",
    "                      train_instance_count=1,\n",
    "                      train_instance_type=instance_type,\n",
    "                      hyperparameters = {\n",
    "                                 \"RLCOACH_PRESET\": \"preset-autoscale-ppo\",\n",
    "                                 \"evaluate_steps\": 10001*2 # evaluate on 2 episodes\n",
    "                             }\n",
    "                    )\n",
    "estimator_eval.fit({'checkpoint': checkpoint_path})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hosting\n",
    "\n",
    "Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same insantance (or type of instance) that we used to train. The endpoint deployment can be accomplished as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inference\n",
    "\n",
    "Now that the trained model is deployed at an endpoint that is up-and-running, we can use this endpoint for inference. The format of input should match that of `observation_space` in the defined environment. In this example, the observation space is a 25 dimensional vector formulated from previous and current observations. For the sake of space, this demo doesn't include the non-trivial construction process. Instead, we provide a dummy input below. For more details, please check `src/autoscalesim.py`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "observation = np.arange(1, 26)\n",
    "action = predictor.predict(observation)\n",
    "print(action)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete the Endpoint\n",
    "Having an endpoint running will incur some costs. Therefore as a clean-up job, we should delete the endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
 "anaconda-cloud": {},
  "kernelspec": {
    "name": "conda_tensorflow_p36",
    "display_name": "conda_tensorflow_p36",
    "language": "python"
  },
  "language_info": {
    "name": "python",
    "version": "3.7.0",
    "mimetype": "text/x-python",
    "codemirror_mode": {
      "name": "ipython",
      "version": 3
    },
    "pygments_lexer": "ipython3",
    "nbconvert_exporter": "python",
    "file_extension": ".py"
  },
  "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
